Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

This paper proposes methods for Machine Learning (ML)-based Beam Alignment (BA), using low-complexity ML models, and achieves a small pilot overhead. We assume a single-user massive mmWave MIMO, Uplink, using a fully analog architecture. Assuming large-dimension codebooks of possible beam patterns at UE and BS, this data-driven and model-based approach aims to partially and blindly sound a small subset of beams from these codebooks. The proposed BA is blind (no CSI), based on Received Signal Energies (RSEs), and circumvents the need for exhaustively sounding all possible beams. A sub-sampled subset of beams is then used to train several ML models such as low-rank Matrix Factorization (MF), non-negative MF (NMF), and shallow Multi-Layer Perceptron (MLP). We provide an extensive mathematical description of these models and the algorithms for each of them. Our extensive numerical results show that, by sounding only 10% of the beams from the UE and BS codebooks, the proposed ML tools are able to accurately predict the non-sounded beams through multiple transmitted power regimes. This observation holds as the codebook sizes at UE and BS vary from 128×128 to 1024×1024.


Introduction
Driven by the explosive growth trend of large-scale connectivity and higher data rate systems, wireless data traffic is expected to exponentially increase, growing to 5 zettabytes per month and reaching a 100 Gps data rate by 2030 [1] Thus, the latency in the 6th Generation is predicted to reach 0.1 ms, representing 10% of 5G latency, in order to support new emerging technical needs, including holographic images, Internet of Things applications, and autonomous driving.
Beam Alignment is frequently defined in the literature as beam sounding, i.e., beam training.It illustrates a fundamental problem in millimeter-wave Multiple Input, Multiple Output systems, defined as the exchange of information between the user equipment UE and the base station BS in order to accurately select the optimal beam-steering direction.The process of aligning the beams is related to several technical problems, such as beam forming, beam sweeping, beam tracking, and beam selection.The whole framework that unites these operations between UE and BS is often denoted as the Beam Management.To fulfill the BA task, beam patterns stored in large codebooks are used at both UE and BS.In fact, pencil beams with directional gain are increasingly being used in several applications in order to alleviate the severe path-loss attenuation and increase capacity and data throughput.On the other hand, massive MIMO systems provide large gain in spectral and energy efficiencies compared with conventional MIMO systems.Using mmWave technology, these systems mainly offer a better communication quality by increasing the system bandwidth and reducing the effects of noise and interference.Due to the diversification of future 5G and towards 6G applications and intelligent systems, scientists predict the continuous generation of massive datasets for deep processing through large bandwidths, which introduces mmWave bands as the golden spectrum band candidates.However, the limitations of mmWave communication physical properties of the channel are crucial: scattering, attenuation, low coherence time related to the Doppler effect, penetration loss, environmental constraints, and complex channel modeling in realistic urban scenarios.The major problem we aim to encounter in this paper is the inevitable high signaling/training overhead.For this reason, the main trade-off is to browse the most accurate and the least complex ML algorithm that optimizes finding the optimal beam pair based on sounded instantaneous Received Signal Energies and using the minimum (possible) amount of training samples.
Contributions: In this current work, we propose ML-based BA methods, for a single user massive mmWave MIMO, Uplink, with a wide-band channel.We assume a single radio frequency chain with large codebooks of possible analog beams at BS (also known as BS codebook) and UE (also known as UE codebook).We define a beam pair as one beam from the BS and UE codebook.By approximating the SNR with the Receive Signal Energy (RSE), we bypass the need for CSI, i.e., a blind approach.We sub-sample large codebooks into smaller sub-sampled BS and UE codebooks, and sound the beam pairs from the sub-sampled codebooks to generate the training set-a novelty of the approach.Using the RSE of the sounded beam pairs (sub-sampled codebooks), we propose to train the following ML methods to predict the RSE of the beam pairs that were not sounded: Matrix Factorization (MF), non-negative Matrix Factorization (NMF), and feed-forward (shallow) Multi-Layer Perceptron (MLP).

•
We formulate the MF and NMF problems.We propose to use Block Coordinate Descent (BCD) and Block Gradient Descent (BGD) methods to solve each problem.We derive in depth all the update equations for these methods.We show that the BCD method converges to a stationary point from both MF and NMF problems.Our extensive numerical results show that, sub-sampling 10% of the BS/UE codebooks, the remaining RSE values can be predicted extremely well (with a training/test error ≈ 10 −6 ) for every antenna configuration.

•
We develop at length the equations of a general MLP model, the resulting loss function, and the corresponding optimization problem.In addition, we derive the equations of back-propagation for the MLP in question.Using extensive numerical results, we observe that sounding 10% of original codebooks is sufficient to predict the RSE of the beam pairs that were not sounded, with negligible training/test error.

•
We numerically compare the training/test losses of all the proposed models for a varying cardinality of codebooks and transmit powers.These results suggest that the BCD method for MF/NMF outperforms the MLP in terms of training and test error.Meanwhile, BCD for MF/NMF has a large computational complexity and the MLP exhibits medium complexity.• Interestingly, by sounding 10% of the BS/UE codebooks, the proposed ML models can predict the unknown RSE (beam pairs not sounded) with a negligible test error.Thus, the proposed methods achieve a 90% reduction in pilot signaling overhead, compared with the SotA benchmark, without any noticeable loss in performance.
Notations: Matrices and vectors are respectively written in boldface upper-case and lower-case letters.We use Tr for the trace, transpose, inverse, conjugate transpose, determinant, and Frobenius norm of a matrix A A A and the n × n identity matrix.
[A] i,j is used to denote the (i, j)th entry of a matrix A A A. We denote the Hadamard product by •, while [a a a] + := max(a a a, 0) illustrates a Euclidean projection of a a a on R D + and is applied element by element on a a a.We denote |x| the absolute value of x and [x] t as the entry t of a vector x x x.
Methods/Experiment: The proposed approach is data driven and model based.The dataset is generated following the Saleh Valenzuela wide-band mmWave system model.It is based on Received Signal Energies for each and every beam pair in the massive MIMO Uplink setup stored in separate .csvfiles.The model-based solution to the empirical risk minimization includes deriving a closed-form solution to the formulated non-convex optimization problem, stating the theoretical guarantees of convergence and empirically illustrating the success of the proposed partial and blind Beam Alignment procedure using different algorithms.All simulations are executed on Infres GPU servers and the Comelec laboratory PC at Télécom Paris, having the following characteristics: Intel(R) Core(TM) i5-8365U CPU @ 1.60 GHz, 16 Go (RAM), x64 processor, and 64-bit operating system under the license of Windows 10 Enterprise LTSC 2018, version 1809.The manufacturer is Dell and is located in Paris, France.All python packages used in this work (numpy, scipy, keras, pytorch, matplolib..) are related to python 3.9 release.In fact, the experimental protocol is based on offline grid-search cross-validation, which requires GPU processing for the selection of optimal hyperparameters and online training/prediction for Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron.The comparison is conducted following a Quality of Service-based approach, simulating a variety of MIMO configurations and architectural setups, investigating the impact of varying the Received Signal Energy regime and empirically stating intersections and differences in the impact of the transmit power on model behaviors, loss values, optimal signaling overhead ratio, and optimal hyperparameters.

•
Problem Statement: The main challenge addressed in this study is the high signaling overhead in Beam Alignment for mmWave MIMO systems, which hampers the efficient selection of optimal beam-steering directions.

•
Research Questions and Hypotheses: This study investigates whether machine learning methods can effectively reduce the signaling overhead required for accurate beam-pair prediction in mmWave MIMO systems.

•
Objectives and Aims: The primary objective is to develop and evaluate ML-based BA methods that minimize the training overhead while maintaining high accuracy in predicting the RSE for unsounded beam pairs.• Significance and Rationale: The study proposes a novel approach to BA using ML techniques, which can lead to a substantial reduction in pilot signaling overhead and enhance the efficiency of future wireless communication systems.

Literature Survey
In conventional standards, Exhaustive BA, also called Brute Force BA, is the de facto approach for the alignment process.It is based on sounding all available beams at both UE and BS codebooks in order to exhaustively select the optimal beam pair.One obvious drawback is the fact that the resulting signaling overhead scales as the product of the UE and BS codebook sizes.At 60 GHz, the Exhaustive BA has been adopted in several mmWave W LAN or WPAN communication technologies, e.g., IEEE 802.15.3c [2] and IEEE 802.11ad [3].It is conventionally applied in small MIMO configurations using small codebook sizes (e.g., codebooks of size 8 × 8 for LTE) and guarantees optimal performance.For cellular networks [4], V2X communications, Unmanned Aerial Vehicles, or High-Speed Train applications, the infeasibility of brute-force-based BA pushes scientists to reduce the large signaling overhead from using massive antennas systems.State-of-the-art methods can be divided into two categories: classic BA and learning-based BA.Traditional techniques tend to use a more and more structured Beam Alignment design such as hierarchical multi-level codebooks [5] (training beamforming vectors are constructed with different beam widths at different levels) and an overlapped beam pattern [6], where the main idea is to augment the amount of information carried by each channel measurement, reducing the required channel estimation time and beam coding [7], where we assign a unique code signature to each beam angle in addition to subspace estimation/decomposition-based BA [8].Compressed sensing-based algorithms [9] are also used in this context, taking advantage of channel sparsity.Therefore, we state two intersections in classic methods: they generally rely on CSI exchange and Exhaustive BA.In contrast, lately, Machine Learning (ML)-based BA has emerged and is continuously leading to some promising results.For instance, statistical models such as Kolmogorov model-based BA in [10] with sub-sampled codebooks reduce the signaling overhead: 15% of Exhaustive BA provides accurate predictions for optimal beams at UE and BS in a partial BA procedure, similar to our approach.Deep learning through shallow neural networks is increasingly used by Wireless Communication scientists, where we distinguish two major paradigms: first, the ML methods related to Supervised Learning (SL) via a Support Vector Machine and Multi-Layer Perceptrons for joint analog beam selection in [11], convolutional neural networks for beam management in sub-6 GHz in [12] and for calibrated beam training in [13], recurrent neural networks such as Long Short-Term Memory network for beam tracking in [14][15][16], auto-encoders for beam management in [17], and several other neural architectures, and second, Reinforcement Learning (RL) in [18][19][20], generally used to resolve the problems of Multi-Armed Bandit and Markov decision process.In addition, neural architectures have the ability to extract features from the hidden interactions between BS and UE, providing fast and accurate estimations through different MIMO setups and channel realizations, especially when applied to massive datasets where more and more data/train samples are embedded.This work is an extension of [21].In this paper, we extend the channel model to wide-band and we add multiple RF-chains at BS in a fully analog low-complexity architecture, where we investigate more ML tools for partial and blind BA.This paper is one of the first attempts to apply MF/N MF models and shallow Multi-Layer Perceptrons to a blind and partial Beam Alignment for massive mmWave SU-MIMO.Our work in [22] is related to the same approach and objectives, where we quantize the output of each RF-chain.

System Model
In this section, we illustrate the mmWave MIMO point-to-point system model.We consider an Uplink transmission from multiple-antenna user equipment UE using a single radio frequency chain and a multiple-antenna base station BS using multiple radio frequency chains.The proposed ML methods are performed at the BS, which has higher computational resources than UE. Figure 1a,b provide a diagram representation of the proposed architecture.UE and BS are respectively equipped with Uniform Linear Arrays of N T and N R antenna.We propose a low-cost/complexity fully analog architecture where UE has one radio frequency chain and BS has N r f radio frequency chains.UE selects its analog beamformer f f f u ∈ C N T from a codebook of feasible beam choices, u ∈ T , where T is the corresponding index set.Moreover, BS selects its analog combiner W W W i ∈ C N R ×N r f from a codebook i ∈ R with R as the index set of the codebook.We denote with C T the number of possible beamforming vectors at UE, i.e., the size/cardinality of the UE codebook, |T | = C T and C R , and the size/cardinality of the BS codebook, |R| = C R .Both beamforming and combining are fully performed in the analog domain using phase shifters at UE and BS; thus, they satisfy the following constant modulus constraints, ∀ r ∈ {1, . . ., N R }, ∀ t ∈ {1, . . ., N r f }: For our proposed approach, BS is responsible for receiving signal energies, denoted as RSE, in order to learn their patterns and features for the purpose of accurately predicting the optimal beam indexes from their corresponding codebooks and send them to UE.We adopt the wide-band channel model where Nc represents the number of sub-carriers over the whole bandwidth through an OFDM scenario, k is the index of the sub-carrier k, and H H H l ∈ C N R ×N T is the narrow band channel model representing the time domain channel impulse response with L-tapped delays given by where L is number of paths (rank) of the channel; θ (R) i and θ (T) i are the angles of arrival at BS and the angles of departure from UE, noting AoA/AoD to correspond to the i th path (and both assumed to be uniform over [−π/2, π/2]); ρ i is the complex gain of the i th path such that ρ i ∼ CN (0, 1), ∀i; and last but not least, a a a R (θ are the array response vectors at both UE and BS, respectively.We further assume that the channel is completely unknown to both UE and BS.Henceforth, in this paper, we shall denote the beam pair (u, i) as the combination of the UE beamformer indexed u from the UE codebook T and combiner indexed i in the BS codebook R. The signal at BS resulting from applying the beam pair (u, i), y y y u,i ∈ C N r f is expressed as where s u = 1 √ P u is the transmitted pilot symbol associated with f f f u (having power √ P u ) and n n n i = W W W H i n n n is the effective additive white Gaussian noise AWGN with unit variance (σ 2 = 1).We define the received Signal-to-Noise Ratio (SNR) for the beam pair (u, i) We assume a fully blind approach; i.e., neither BS nor UE has any knowledge of G G G. Thus, computing the above SNR expression is not feasible due to the fact that BS is assumed not to know G G G. Thus, in this work, we will approximate the SNR of the beam pair (u, i) using the corresponding instantaneous Received Signal Energies (RSEs) expressed as RSE u,i = ||y y y u,i || 2  2 , ∀(u, i) ∈ T × R .In other words, we will assume that RSE u,i ≈ SNR u,i for each beam pair (u, i) ∈ T × R.
Benchmark: Exhaustive BA: The de facto method for Beam Alignment is Exhaustive BA.It is accomplished by exhaustively sounding, jointly, the beams of both UE and BS codebooks, recording all entries of RSE, and exhaustively searching S S S for the indexes of the beam pair that maximize RSE at BS, i.e, (u ⋆ , i ⋆ ) = argmax (u,i)∈T ×R RSE u,i .Thus, the RSE matrix is computed/recorded N r f -entries, with each of pilot symbol, since N r f samples are simultaneously received at the BS for every pilot transmission (see Figure 2).Consequently, the pilot signaling overhead of the Exhaustive BA is which implies that the overhead of this benchmark scales poorly with the BS and UE codebooks.
Proposed partial Beam Alignment using sub-sampled codebooks: Recall the designation of the beam pair (u, i) as the beamforming vector of the index u in the UE codebook of beams and the combining vector of the index i in the BS codebook of beams.First, we select (at random) the indexes of the sub-sampled codebooks of beams at UE and BS, R S and T S , such that R S ⊂ R and T S ⊂ T , and |R S | ≪ |R| |T S | ≪ |T |.The idea behind this approach is to only sound beam pairs from the sub-sampled codebook of beams, R S and T S .We thus define the training set, K, as the sub-sampled codebook indexes at UE and BS, i.e., K := {(u, i) | (u, i) ∈ T S × R S }.Then, the RSE of the sounded beam pairs (training set) is given to several ML methods, and the learned ML model is used to predict the RSE of non-sounded beam pairs.We formalize this proposed method below.We express both the received signal y y y (u,i) and RSE for the beam pair (u, i) resulting from the sounded beam pairs (i.e., training set), as follows: The dataset is formulated using the following incomplete RSE matrix, where [S S S] u,i denotes the element (u, i) of S S S, ∀(u, i) ∈ T × R. Evidently, the value of RSE is undefined for the beam pairs that were not sounded, designated as unknown-RSE matrix coefficient.Those are the missing entries, which are predicted using one of the following proposed ML methods: (i) low-rank MF/NMF and (ii) shallow (feed-forward) MLP, where we utilize the sounded RSE entries as the training set, K. Then the training set, K, is fed into one of the above ML models, which will predict the RSE of non-sounded coefficients in S S S, denoted as 'Unknown', in (5) (see Figure 3).Finally, the pilot signaling overhead for the above-proposed sub-sampled codebook method is Ω = |T S × R S | / N r f = |K| / N r f .We split the RSE dataset into a training set K and a test set L such that K ∩ L = {}.In this paper, RSE u,i denotes the true value (label) of the RSE for the beam pair (u, i) in the training set K, and RSE u,i denotes the true value (label) of the RSE for the beam pair (u, i) in the test set L.
Signaling overhead ratio: It is defined as η := overhead of learning-based BA overhead of Exhaustive BA where T S and R S are, respectively, the sizes of the UE and BS sub-sampled codebooks used in our proposed partial beam sounding, while T and R refer to the original size of the codebooks, and 0 < η ≤ 1 measures the signaling overhead of all the proposed MF, MLP, and AE methods compared with that of Exhaustive BA.Evidently, a small value for η is desired to reduce the signaling overhead of our proposed method.However, a low η implies that the size of the training set is small.As a result, the proposed ML method will not be able to extract enough data patterns due to the (too) small number of training samples, resulting in a larger prediction error.As one of the contributions of this work, we will (empirically) find as small a value for η as possible while still having extremely small training and prediction error.Conjecture: Note that, from the equations of the narrow-band channel model H H H and the wide-band channel model G G G(k), it is simple to verify that rank(H H H) ≤ L and rank(G G G(k)) ≤ LN C .Assume that P u → ∞.Thus, we can approximate the RSE matrix as If P u → ∞, then it can be shown that the RSE matrix S S S is such that rank(S S S) ≤ LN C .This implies While the proof for this necessary condition eludes the authors, we empirically observed that if P u is large, then the number of non-zero singular values of S S S, {σ i (S S S)} rank(S S S) i=1 , satisfies the above upper bound, i.e., | {σ i (S S S)} rank(S S S) i=1 | ≤ LN C .

Remark 1.
Recall the expression for the effective rate, r, r = (1 where Ω is the pilot signaling overhead and T is the number of symbols per block.Thus, the problem of maximizing r is written as the following series of equivalent problems: where the last ⇔ is due to the fact that the log(x) is a strictly monotonically increasing function in x.This result implies finding the optimal beam pair (u ⋆ , i ⋆ ) that maximizes r is equivalent to finding the best beam pair that maximizes the RSE.
Remark 2. The information (number of entries) needed to represent the RSE matrix S S S ∈ C C R ×C T is measured as rank(S S S)(1 + C T + C R ).This result is evident from performing the SVD on S S S and counting the resulting number of entries.Thus, if S S S is severely rank deficient, i.e., extremely compressible, then methods such as MF/N MF will exhibit extremely small training and test error.Conversely, if S S S is full rank, i.e., not compressible, then the training and test of MF/N MF will be quite large.

MF and NMF Problem Formulation
The intuition behind low-rank MF is to model the RSE of the sounded beam pairs (i.e., entries of S S S that are known as T S × R S ) as an inner product between two D-dimensional latent vectors/factors, θ θ θ u , ψ ψ ψ i , as illustrated in Figure 4. Specifically, the RSE of the beam pair (u, i), denoted as [S S S] u,i , is modeled as , where D is the size/dimension/complexity of the Matrix Factorization model latent factors and θ θ θ u ∈ R D , ψ ψ ψ i ∈ R D are the MF model parameters (to be optimized).In addition, due to the low-rank MF model, D is assumed to be much smaller than the dimensions of S S S, i.e., D ≪ (C T , C R ).The RSE of the beam pair (u, i) is known from sounding the sub-sampled codebooks (i.e., label).The general formulation of our loss function ℓ u,i describes the distance between the true value RSE u,i and the predicted value θ θ θ T u ψ ψ ψ i , which corresponds to the MF output/prediction: ℓ u,i := (RSE u,i − θ θ θ T u ψ ψ ψ i ) 2 , ∀ (u, i) ∈ K(:= T S × R S ).The Empirical Risk (also known as training error) is defined as the average across all the individual loss function ℓ u,i .We define the regularized Empirical Risk function as the above empirical risk in addition to the following regularization terms: where For the Matrix Factorization variant N MF, the optimization problem is given by where { θ θ θ u , ψ ψ ψ i } denotes the optimal latent vectors for MF and NMF.The test loss (also knows as test error) is given by applying the general loss on the unknown data samples (non-sounded beams) using optimal MF/NMF parameters θ θ θ u and ψ ψ ψ i : , where L is the test set of our learning model.

Solutions for MF
We resolve the MF problem (P1) using the following methods: (i) Block Coordinate Descent (BCD) often denoted as Alternating Least Squares (ALSs), (ii) BCD with Stochastic Gradient Descent, and (iii) Block Gradient Descent (BGD), which merges BCD and Gradient Descent (GD) definitions.
BCD for MF (BCD MF): BCD proceeds by splitting the optimizing problem (P1) into sub-problems, supposing that all other blocks are known/fixed.We will show that each sub-problem is strongly convex in each block, and the BCD algorithm converges to a stationary point.The application of BCD to the MF problem results in two sub-problems, S1 and S2, which are solved iteratively.At iteration k, the sub-problem (S1) is defined by fixing the block {ψ ψ ψ (k) i } ∀i and the update/solve block {θ θ θ u } ∀u only,as follows: Moreover, the sub-problem (S2) is defined by fixing the block {θ θ θ (k+1) u } ∀u in (P 1 ) and the update/solve block {ψ ψ ψ i } ∀i , only, as follows: We will rewrite S1 into as series of equivalent problems as follows: where U i is the set of row indexes u in the RSE matrix corresponding to the column i in the known entries of the RSE matrix, ) and r r r i ).We derive the closed-form solution for the sub-problem S1 by finding the global min of f 1 (θ θ θ u ), as follows: Similarly, we rewrite the sub-problem (S2) into the following series of equivalent problems by stating the last one: where I u is the set of column indexes i in the RSE matrix corresponding to the row u in the known entries of the RSE matrix, t t t ) and P P P ). Next, we derive a closed-form solution for the sub-problem S2 by finding the global min of f 2 (ψ ψ ψ i ), as follows: Thus, BCD updates to solve MF are given as follows: where (k) represents the index of the BCD iterations, (u,i) are the codebook indexes at UE and BS, and [S S S] u,i denotes the RSE of the (u,i) beam couple.The solution { θ θ θ u , ψ ψ ψ i } (u,i)∈K is reached after the interval/gap between consecutive iterations reaches a predefined ϵ or a max number of iterations, I M .We have the following result.
Corollary 1.The sequence of updates {θ θ θ , is non-increasing (in k) and converges to a stationary point as k → ∞.

Block Stochastic Gradient Descent (BSGD) for MF (SGD MF):
SGD MF proceeds by taking T plain SGD steps (mini-batch size = 1).BGD proceeds by taking T SGD steps for each block BCD.We first choose at random a single training sample (u, i) ∈ K.The BSGD update for the sub-problem (S1) is done by performing SGD for f 1 (θ u ) = ∑ u∈U i h u (θ u ), i.e., choosing at random a single index u ∈ U i and computing the plain SGD ∇ f 1 (θ u ) = ∇ ∑ u∈U i h u (θ u ) = h u (θ u ), where u is a random index from U i , and ∇ f 1 (θ u ) is the plain SGD on f 1 ().The corresponding update is given as (k) is the iteration index for SGD, and ∇ f 1 (θ θ θ u ) is the plain SGD over one random sample u ∈ U i .Similarly, the update for the sub-problem (S2) is done by taking T plain SGD steps of f 2 (ψ ψ ψ) = ∑ i∈I u h i (ψ ψ ψ i ), i.e., the SGD, ∇ f 2 (ψ ψ ψ i ) = ∇(∑ i∈I u h i (ψ ψ ψ i )) = h i (ψ ψ ψ i ), where i is single random index from I u .Thus, the SGD MF update for the sub-problem (S2) is expressed as where i is a single index chosen randomly from I u , t t t and ∇ f 2 (ψ ψ ψ i ) is the plain SGD gradient computed with one sample i ∈ I u , chosen at random.We write the SGD MF updates as where u is a random index chosen from U i , and i a random index from I u .0 ≤ α k ≤ 1 is the step size for SGD.

BGD for MF (BGD MF):
Rather than having a closed-form solution for each optimization block, BGD proceeds by taking T gradient steps for each block T gradient step.We skip the details here for space limitations.Thus, the BGD updates for the MF problem are expressed as where (u,i) are the codebook indexes at UE and BS, k is the GD iteration index, and α (k) is the BGD step size (0 < α (k) < 1).

Solutions for NMF
Our proposed N MF follows the exact steps as in MF, with the main difference of constraining the latent vectors being non-negative θ θ θ u ∈ R D + , ψ ψ ψ i ∈ R D + , ∀ (u, i) ∈ K. Likewise, we solve the N MF problem, (P2), using BCD, SGD, and BGD.
BCD for NMF (BCD NMF): The derivations of BCD for N MF (11) are identical to those of BCD for MF (8), followed by the corresponding projection operation.The updates of BCD for N MF derivations are given by where (k) is the BCD iteration index, and [a a a] + := max(a a a, 0) is applied element by element on a a a, i.e., a Euclidean projection of a a a on R D + .Since the projection is Euclidean (non-expansive operator), the corollary stated in the previous subsection applies to the BCD for N MF too.
Block Stochastic Gradient Descent (BSGD) for NMF (SGD NMF): The SGD NMF derivations are exactly the same as that of SGD MF, followed by a projection [] + .We thus express the SGD NMF updates as where u is a random index chosen from U i , i is a random index from I u , [a a a] + := max(a a a, 0), and α (k) is the SGD step size (0 < α (k) < 1).

BGD for NMF (BGD NMF):
The solution and derivations for BGD NMF are the same as those for BGD MF, followed by a projection [] + , i.e, where [a a a] + := max(a a a, 0), (k) is the GD iteration index and α (k) is the GD step size (0 < α (k) < 1).We use a constant step size α k = α for all these methods.

Prediction for MF and NMF
For both MF and N MF, the predicted RSE of the beam-pair (u, i), for beam indexes that were not sounded, is expressed as where L is the test set and { θ θ θ u ) T , ψ ψ ψ i } are optimal solutions to MF (or NMF).Afterwards, we search for the optimal beam pair at UE and BS as the one with the highest RSE value over both training and test sets, as follows:

Proposed BA Algorithm Using MF/NMF
Due to the fact that the updates given in a closed-form solution, we can quantify the computational complexity of all of the above methods.As seen from the updates for BCD MF and BCD NMF, we have to invert two D × D matrices (for the sum problems S1 and S2).Thus, the (per-iteration) computational complexity of BCD MF and BCD NMF is approximated as C BCD MF = C BCD N MF = O(2D 3 ).Moreover, for BGD MF and BGD NMF, one has to compute two full-batch gradients over all training samples in K (for the subproblems S1 and S2).Consequently, the complexity, per-iteration, for BGD MF and BGD NMF is approximated as C BGD MF = C BGD N MF = O(2|K|).Finally, for SGD MF and SGD NMF, since we use a mini-batch size = 1 (for the sub-problems S1 and S2), the resulting per-iteration computational complexity is approximated as Solving the MF and N MF problem, we employ methods such as BCD, BGD, or SGD.All details are shown in Algorithm 1.
-Record corresponding RSE in and generate mat.S S S, in (5) -Select model: MF or NMF -IF MF model selected solve (P1) with BCD for MF, in (8) or solve (P1) with BGD for MF, in (10) or solve (P1) with SGD for MF, in (9).At the end of training, return optimal latent vectors, { θ θ θ u , ψ ψ ψ i } (u,i)∈K -IF NMF model selected solve (P2) with BCD for NMF, in (11) or solve (P2) with BGD for NMF, in (13) or solve (P2) with SGD for NMF, in (12).At the end of training, return ideal latent vectors, -Search training and test sets, for beam pair w/ largest RSE, (15) Output: While, for MF BCD and NMF BCD, the only hyperparameter is the model size D, MF BGD and NMF BGD require, in addition to D, α k , the GD step size as hyperparameters.
We propose to investigating six models in total (BCD MF, BCD NMF, BGD MF, BGD NMF, SGD MF, SGD NMF) with respect to three transmitted power regimes: high Pu = 1W, medium Pu = 10 −1 W, and low Pu = 10 −2 W with fixed σ 2 = 1.In Table 1, we provide a summary for all proposed system parameters.We use the training Normalized MSE (N MSE) to evaluate the training error, expressed as Train NMSE = ).The range of training error and the overall behavior of BCD-based models are different and distinctive from GD models in both MF and N MF; for instance, BGD-based models' error range are around ×10 −7 , while BCD-based models are around ×10 −4 .Thus, GD is more accurate.However, BCD converges faster and the cost function drops to low values from the very first iterations.In addition, for MF and N MF, the train N MSE decreases with the increase in the overhead ratio η, as seen in Figure 5. Low and medium P u regimes are characterized by noisy links between UE and BS and represent a more challenging experimental environment.BCD-based models tend to be faster in reaching low error values, while BGD-based models are more accurate.(For instance, BSGD generally ameliorates the quality of prediction compared with BGD).Regarding MF/N MF simulation figures, Figure 5a states the decrease of train/test N MSE in function of the overhead ratio (more training samples result in fewer errors); Figure 5b,c track the instant drop in loss values from the very first iterations for BCD-based models; and Figure 5d,e present the progressive convergence of cost function among the iterations when we use BGD-based models.In summary, Table 2 outlines the optimal (minimum) signaling overhead ratio required for the all proposed system configurations, the optimal model (holding the smallest total cost function), the related combination of optimal hyperparameters, and the corresponding train/test error values.When the signal is affected with much noise, it is harder to keep the same range of error when compared with high a P u regime.In fact, MF models keep the same (minimum) signaling overhead (0.1) regardless of the transmitted power regime, being able to accurately predict with just 10% of sounded beams.Thus, the proposed MF/N MF methods are able to reduce the pilot signaling overhead by 90% compared with Exhaustive BA with negligible training and test errors.

MLP Problem Formulation
We consider a feed-forward MLP, with J layers, modeled as a composition of J nonlinear functions/layers.Let z 0 ∈ R be the MLP input, and z J ∈ R be the MLP output; see Figure 6.We denote with {z z z 2 , . . ., z z z J−1 } all the hidden layers.We assume for simplicity that the width of all the layers is the same, denoted as D, i.e., {z z z 2 ∈ R D , . . ., z z z J−1 ∈ R D }; see Figure 6.The equation describing layer 1 is given by z z z 1 = σ 1 (ϕ ϕ ϕ 1 z 0 ) = σ 1 (ϕ ϕ ϕ 1 1), where z z z 1 ∈ R D is the output of layer 1, ϕ ϕ ϕ 1 ∈ R D is the resulting weight vector, and σ 1 () : R −→ R D is the non-linear activation function for layer 1.We use one hot encoding for the MLP input z 0 ∈ R, i.e., z 0 = 1 for all training samples, ∀(u, i) ∈ K.We express the output of the hidden layers, {z z z j ∈ R D } J−1 j=2 , as z z z j = σ j (Φ Φ Φ j z z z j−1 ) , ∀ j ∈ {2, . . ., J − 1}, where z z z j−1 ∈ R D is the input of the layer j and z z z j ∈ R D is its output , ∀ j ∈ {2, . . ., J − 1}; Φ Φ Φ j ∈ R D×D is the weight matrix for the layer j , ∀ j ∈ {2, . . ., J − 1}; and σ j−1 () : R D −→ R D is the element-by-element non-linear activation function for the layer j, ∀ j ∈ {2, . . ., J − 1}.Finally, the relation for the last layer j = J is expressed as z J = σ J (ϕ ϕ ϕ J z z z J−1 ), where z J ∈ R is the output for layer J, ϕ ϕ ϕ J ∈ R 1×D is its weight vector, and σ J () : R D −→ R is the non-linear activation function for the layer J.We express the output of the MLP z J ∈ R (as a function of all layers) as The output of MLP is made to fit/approximate all the RSE values at all training samples; z J := RSE u,i , ∀(u, i) ∈ K.We define the MSE loss l u,i for the sample (u, i) in the training set K as the distance between the MLP output z J and the known RSE label for the beam pair (u, i), RSE u,i , i.e, Then, the empirical risk is defined as the average of the individual loss l u,i across the training set K, (1/|K|) ∑ (u,i)∈K l u,i .The empirical risk minimization for the MLP is given in (P3).

MLP Learning
We propose to learn the optimal MLP weights via back-propagation (BP).We choose an arbitrary mini-batch of samples of size B ⊆ K and define the mini-batch loss as We express the partial derivative of the loss corresponding to the mini-batch l B with respect to each layer Φ Φ Φ j , j{1, . . ., J} as

Similarities and Differences between Models
All models required just 10% of the beams for training for all the proposed massive setups.Moreover, all the proposed models are shallow neural architectures with few hidden layers for low-complexity constraints.Even among the largest configurations, the optimal dimensions of models picked from the cross-validation illustrate small networks with no need to require dense architectures.Furthermore, all models succeeded with the matrix completion task, and they all illustrate a monotonic decrease in loss values as long as we increase the MIMO setup.Additionally, MF-based models are the most accurate reaching loss values in the range 10 −8 for massive setups in a high P u regime, and their crossvalidation illustrates smaller grid search where there are fewer hyperparameters to tune.However, they are the slowest models when applied to high-dimensional MIMO setups.On the other hand, MLP illustrates a good balance between run time (complexity) and loss values (prediction quality).It reaches around 10 −4 and 10 −5 loss for massive configurations.In addition, the MLP is the most robust model facing the changes in the P u values.In Figure 8, for 512 × 512, the figure illustrates the train/test N MSE in the function of each model and the corresponding transmitted power: in Figure 8a, for P u = 1W, MF achieves its best performance, slightly better than MLP with the difference between achieved cost values at around 10 −1 .In Figure 8b, when P u = 10 −1 W, MF still gets the best performance, marginally better than MLP with an N MSE value difference of around 10 −1 .In Figure 8c, when P u = 10 −2 , MF noticeably gets impacted (overall loss around 10 −3 ) while MLP provides the best prediction performance: this suggests that when P u is small, MLP is more robust than MF/N MF, which performs best in high P u regime.Similarly, almost same remarks hold for Figure 9 when we simulate the 128 × 128 configuration: in Figure 9a, MF reaches considerably better performance compared with MLP with 10 −4 .In Figure 9b, MLP kept the same range of error, which states again the robustness of the model while MF got severely impacted (10 −3 ) but sill holds the best performance.In Figure 9c, when P u is weak, MF illustrates the worst performance in all simulations.On the other hand, MLP got slightly impacted with an overall loss of 10 −1 and reaches the best quality of prediction.In Figure 10, we investigate the highest configuration 1024 × 1024.Similar conclusions for Figures 8 and 9 hold for this figure in terms of best model (MF for P u = 1W, P u = 10 −1 and MLP for P u = 10 −2 ).In addition, we aim to investigate the overall impact of varying the transmitted power.Thus, we track the log(N MSE) values while switching from one P u regime to another: In Figure 10 at each change of P u , MF is considerably impacted.To sum up, the choice of the optimal model strongly depends on the available complexity and the given transmitted power P u .In fact, MF, whether through BCD or BGD optimization, is the best model when the transmitted power is high (P u = 1W).In this case, BCDMF converges faster but has higher complexity than BGD.However, SGD for MF/N MF are the slowest models to converge but show negligible complexity.On the other hand, if we aim to prioritize run time, MLP exhibits the fastest predictions with good prediction error.Finally, it is wise to opt for MLP if the system is to operate under various transmitted power regimes where MLP offers good prediction quality for every P u value and the available complexity is medium.

Conclusions
In this paper, we proposed a blind Machine Learning-based Beam Alignment using Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron.We assumed an Uplink massive mmWave MIMO system using single RF-chains at UE and multiple RF-chains at BS though a fully analog architecture.The proposed approach consists in sounding the RSE of sub-sampled codebooks at UE and BS.The RSE of the non-sounded beams is predicted using MF, N MF, and MLP models.Our results show that, by sounding just 10% of the total beam pair samples, we may predict with high accuracy the unknown RSE values, which massively reduce the large signaling overhead of Exhaustive BA.Our future work investigates the scalability of our approach to a multi-user scenario.Robustness and ML-interpretability are other research directions for modeling industrial deployment.

Figure 1 .
Proposed BA diagram representation: (a) fully analog MIMO architecture using a single RF chain at UE and multiple RF chains at BS; (b) simplified illustration of Beam Alignment problem.
K} is the set of regularization hyperparameters used to balance the MF/N MF model, preventing any overfitting or underfitting.The Empirical Risk Minimization corresponding to the MF model is given by

Table 1 .
System parameters and hyperparameters.

Table 2 .
QoS minimum overhead required for MF/N MF for all proposed Pu regimes.
a |